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Abstract 

We investigate the complexity of bit counting algorithms in different 
sets of instructions. 



1 Introduction 

The operation of counting the number of ones in a binary word consisting of 
n bits is also known as sideways addition [7] or bit count Solutions based 
on look-up tables have been suggested by several authors (see the references in 
[8]), but for large n they are clearly impractical. 

Under unit cost measure, methods of asymptotical complexity O (log log n) 
are known, but they are based on multiplication like the one due to Wilkes, 
Wheeler, and Gill from the first textbook on programming published in 1957 [7 
or division as in Item 169]. The assumption that these complex arithmetical 
operations are executed as efficiently as, e.g., addition or logical instrctions 
appears to be unrealistic. We are therefore mainly interested in approaches to 
the bit count problem avoiding these operations, also ruling out shift instructions 
as special cases of multiplication and division. 

Exercise 2-9 of [5] asks for a solution of the bit count problem based on an 
observation originally due to Wegner [9]. It is a remarkable fact that in two's 
complement representation for i^O the right-most one of x can be deleted by 
the operation x & (x — 1), where & denotes the bit- wise and of two values. With 
the help of this fact the bit count of the input can be computed in a loop that is 
executed once for each one. This will save considerable work in comparison to a 
naive implementation when the number of ones is sparse. When the zeroes are 
sparse the method can be applied to the complement (this was already mention 
in the last section of [5] ) . The worst case as well as the average case complexity 
of these methods is however 6(n). 

In the current work we will develop algorithms for counting the ones in a 
binary word based on the concept of "SIMD with a register (SWAR)" [5]. The 
idea is to partition large registers into fields of bits that are processed in parallel. 
We obtain algorithms of complexity 0(y/n) and 0(log 2 n) in slightly different 
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settings, where the latter approach is less practical for current word-lengths, as 
it requires a lot of large constants. 

Computing the parity of a binary word efficiently has received some attention 
in its own right, see, e.g., [U 0]. It can of course be determined from the 
count of ones, but a specific approach might be faster. For a modified parity 
function we get an O(logn) solution in a restricted computational model which 
is competitive with methods using shift instructions. Also included is a parity 
function making use of integer division which for 32 bits is superior (in terms 
of "C-operations" ) to the implementations the author is aware of. It is inspired 
by [3l Item 169], which we outline in the Appendix. 

2 Preliminaries 

We will denote the binary logarithm by log. The word-length will be denoted by 
n, which is usually a power of 2. Bit positions are numbered starting at with 
the least significant bit. Thus bits of the words being processed have weights 1 
to 2™ _1 . Using the notation from [TJ, the bit count function is called v. 

The main results refer to programming models that rule out multiplication, 
division, and shift instructions. The restricted set of instructions including only 
logical operations (and, or, xor) and addition which will be called here Oblivious 
Parallel Addition and Logical Instructions (OPAL). Including subtraction does 
not change this model in terms of asymptotic complexity, since it can be sim- 
ulated with negation and addition. OPAL is the model employed in [S]. Code 
will be presented in (subsets) of C [5] . 

With Parallel Addition and Logical Instructions (PAL) we will denote the 
extension by flow control instructions like "if " and "while" , usually found in 
modern programming languages. Within the PAL model the bit count function 
from [B] and the improvement of Exercise 2-9 can be realized directly. 

3 Results 

Theorem 1 The bit count function v can be computed with the help of 0(y/n) 
instructions in the PAL model. 

Proof: The central idea is to apply Wegner's technique [9] to approximately 
sfn fields of y/n bits each, separated by spacer bits [5] . Since the original input 
does not have spacer bits, the first task is to count ones at the positions of the 
prospective spacer bits. This number is stored in the variable sum. 

Then the following steps are repeated. First the spacer bits are set. By 
subtracting twice the current most significant bits of each field (plus one for the 
least significant position) , a one is deleted from all fields that contain a one. For 
fields without a one, the spacer bit it reset by this operation. The difference to 
the previous state of the most significant bits of the fields is computed and the 
change is determined with the help of the same technique. The current number 
of "active" fields is stored in the variable count and additional bits reset are 
recorded in sum. The subtraction from an empty field will cause a borrow and 
therefore no explicit subtraction from the next field is necessary. These steps 
are carried out until all fields are empty. 
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The number of iterations is bounded above by the length of the fields and 
thus 0(i/n). The amortized cost of updating variable count is also 0(\/n), 
since each of the 0(y/n) spacer bits can change only once (it will never change 
from to 1 within the loop). Therefore the claimed complexity follows. 

We present the procedure just outlined for an 32 bit input, where for the 
sake of simplicity we round the length and the number of fields to powers of 2: 

#define HIBITSL 0x88888888 // 8 fields with 4 bits each 
bitcount(a) // ALU-based 
register unsigned long a; 
{ 

register int sum; // overall bit sum 

register int count; // incremental contributions 

register unsigned long x; // auxiliary variable 

register unsigned long oldhi, newhi; //hi bits of fields 

x = (a & HIBITSL) ; 
for(sum = 0; x ; x &= (x-1)) 
sum++ ; 

count =8; // bitcount (HIBITSL) 
oldhi = HIBITSL; 
a |= oldhi; 
while (oldhi) 
{ 

a &= (a - oldhi - oldhi - 1) ; 
newhi = (a & oldhi) ; 
x = newhi " oldhi ; 
while (x) { 

x &= (x-1) ; 

count — ; 

} 

oldhi = newhi ; 
sum += count; 

>; 

return (sum) ; 

} 

□ 

Next we exclude flow control (if, while) from the operations allowed. By 
the result from 8 ], function v cannot be computed under this restriction. We 
can however compute a modified version of v for a portion of the bits. 

First we describe an important building block that allows us to shift bits 
rapidly using only addition and logical operations. 

Lemma 1 In a set of disjoint fields, the least significant bit of each field can 
be shifted to the most significant position in constant time, if all intermediate 
positions contain zero. 

Proof: Wc will describe the technique for a single field stretching from position 
i to position j > i. By incorporating information for the other fields into the 
constants, the corresponding modifications can be achieved. 
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First bit j is cleared by masking with —~2K Then the value 2 3 — 2 is added 
to the result. Finally we mask with -^(2^ — 2). 

If bit i initially was 1, then there will be a carry that propagates to position 
j. If it was 0, then position j will not be modified. □ 

We note that the technique of Lemma [T] can replace the bit- by-bit shift of 
bits in the construction from [Sj, which has a time complexity proportional to 
the number of positions. 

Theorem 2 Let m = n— [log n\ . The modified bit count function 2 m ~ l v(x mod 
2 m ) for the lowest m bits of an n bit input x can be computed with the help of 
(3(log 2 (n)) instructions in the OPAL model. 

Proof: We will demonstrate how to compute the function for the lower n/2 
bits of a word with n bits. The claim then follows by computing in addition 
the function for the most significant n/2 — [logn] bits (shift all constants ac- 
cordingly), adding the results and subtracting the count of ones in the overlap 
by first shifting each bit to the target area and then subtracting it. The latter 
correction can be done in time O (log ro) by Lemma [1] 

In stage i of the method, counts of ones in fields of length 2 l_1 are combined 
into counts for fields of length 2* by applying Lemma Q] the bits of the counts. 
Since by Lemma [T] we can only shift left, the counts are are located at the most 
significant bit of fields. 

There are O(logn) stages and each stage takes time O(logn), resulting in 
the claimed bound Ok>g 2 (n) 

The following code implements the method for the lower 16 bits of a 32 bit 
word. Binary representations of the masks are given in the comments 

// least significant 16 bits counted, 20 bits required 

// (excess 4 bits for 5 bit count) 

bitcount (x) 

register unsigned x; 

{ 

register unsigned y; 



y 


= x & 0x5555; 


// 


0101010101010101 


X 


-= y; 


// 


delete fields 


y 


+= 0x5555; 


// 


0101010101010101 


y 


&= OxAAAA; 


// 


1010101010101010 


X 


+= y; 






y 


= x & 0x6666; 


// 


0000110011001100110 


X 


-= y; 


// 


delete fields 


y 


+= OxCCCC; 


// 


0001100110011001100 


y 


&= 0x73333; 


// 


1110011001100110011 


y 


+= 0x6666; 


// 


0000110011001100110 


y 


&= 0x79999; 


// 


1111001100110011001 


X 


+= y; 






y 


= x & 0x3838; 


// 


0000011100000111000 


X 


-= y; 


// 


delete fields 
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y 


+= 


OxlElEO; 


// 


0011110000111100000 


y 


&= 


0x61ElF; 


// 


1100001111000011111 


y 


+= 


OxFOFO ; 


// 


0001111000011110000 


y 


&= 


0x70F0F; 


// 


1110000111100001111 


y 


+= 


0x7878; 


// 


0000111100001111000 


y 


&= 


0x78787; 


// 


1111000011110000111 


X 


+= 


y; 






y 


= x & 0x780; 


// 


A A A A A A A A A A AAA A -1 -1 H H AAAAAAA 

OOOOOOOOOOOOOOllllOOOOOOO 


X 




y; 


// 


delete fields 


y 


+= 


0x3FC00 ; 


// 


0000000111111110000000000 


y 


&= 


0x7C03FF; 


// 


0011111000000001111111111 


y 


+= 


OxlFEOO; 


// 


0000000011111111000000000 


y 


&= 


0x3E01FF ; 


// 


0001111100000000111111111 


y 


+= 


OxFFOO ; 


// 


0000000001111111100000000 


y 


&= 


OxlFFOOFF; 


// 


1111111110000000011111111 


y 


+= 


0x7F80 ; 


// 


0000000000111111110000000 


y 


&= 


0xF807F; 


// 


0000011 1 1 100000000 1111111 


X 


+= 


y; 







return (x) ; 

} 

□ 

By some pre- and post-processing we can obtain a solution in the PAL model 
of the same asymptotic complexity. Due to the large constant factor it does 
however not appear to be of practical value for small word-lengths. 

Corollary 1 The modified bit count junction v{x) of an n bit input x can be 
computed with the help of 0(log 2 (n)) instructions in the PAL model. 

Proof: We proceed as follows: 

1. The 0(logn) most significant positions not handled by the method of 
Theorem [5] are copied to a variable and set to in i (time complexity 
0(1)). 

2. For the n — 0(logn) bits the modified bit count function is computed 
according to Theorem [5] (time complexity 0(log 2 n)). 

3. The resulting O(logrt) bits are transferred one by one to the least signifi- 
cant positions by extracting each bit with the help of a mask and building 
up the result in another variable (time complexity O(logn)). 

4. The most significant bits of the original input saved in step[T|) are counted 
in a naive way and added to the result of step[3]) (time complexity O (log n) ) . 

The overall time complexity is dominated by step [5]) and thus 0(log 2 n). □ 

For the parity function only a single bit has to be shifted in an approach as 
in Theorem [21 Therefore we obtain a more efficient solution than for bit count. 



5 



Theorem 3 The modified parity function 2™ [y{x) mod 2) can be computed 
with the help of O (log n) instructions in the OPAL model. 

Proof: In the initial stage the bits of the input are moved up one position and 
the parity of pairs of bits is computed by a XOR. By masking out the lower bits 
of each pair, we obtain 2 bit fields containing the parity in the most significant 
bit. 

In the following stages we apply a variant of Lemma [TJ where we propagate 
the most significant bit of one field directly to the most significant bit of the 
next field. We make use of the fact that in the least significant positions of 
each field spacer bits in the sense of [5] are available after masking the most 
significant bit. Each stage doubles the size of the fields. 

We illustrate the method for 32 bits in the following code: 

parity (x) 

register unsigned x; 
{ 



x 




(x + x) ; 






X 


&= 


Oxaaaaaaaa 


// 


parity of 2 bit fields 


X 


+= 


0x66666666 






X 


&= 


0x88888888 


// 


4 bit fields 


X 


+= 


0x78787878 






X 


&= 


0x80808080 


// 


8 bit fields 


X 


+= 


0x7f807f80 






X 


&= 


0x80008000 


// 


16 bit fields 


X 


+= 


0x7fff8000 






X 


&= 


0x80000000 


// 


all 32 bits 



return(x) ; 

} 

□ 

By the result from [8] the OPAL model is not able to directly compute the 
parity function. With the help of a single test a non-zero result can however be 
transformed into 1. We thus obtain: 

Corollary 2 The parity function v{x) mod 2 can be computed with the help of 
O(logn) instructions in the PAL model. 

We finally include code for computing the parity function inspired by the 
bit count function from [3] of asymptotic complexity O (log log n). The novel 
approach is to xor bits first and then count the resulting ones. By reducing 
the number of possible ones, the fields can be smaller than in the more general 
setting of a full bit count. 

More precisely, if we choose a field size of k bits, then it suffices to have 



in order to be able to do the counting within fields of k bits each. For n = 32 
we can choose k = 4 and obtain the following solution, which uses only 7 "Co- 
operation" as opposed to the 8 operations of the "parity of word with a multiply" 
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from pp. 



parity (x) 

register unsigned x; 
{ 

x "= x >> 1; // parity of 2 bit fields 
x "= x >> 2; // parity of 4 bit fields 
x &= 0x11111111; // select low bits of 4 bit fields 
return((x '/. Oxf) & 0x1); // apply HAKMEM technique, 

// return least sign, bit 

} 

This approach would work for word-lengths up to 56. Two more instructions 
would suffice for up to 2032 bits (with constants adjusted). 

4 Discussion 

We have obtained several bit counting and parity algorithms for restricted sets 
of operations. Asymptotically the (modified) parity function with complexity 
O(logn) for the OPAL-model is as fast as the known solutions based on broad- 
word steps [TJ. The latter may include shift-operations by a constant number 
of bit positions. A lower time bound f2(logn/loglogn) follows from a result in 
circuit complexity. The bound is tight for circuits, but it is open if this is also 
true for the models investigated here. 

Appendix 

HAKMEM item 169 3 describes a reduction of bit counting for word-lengths of 
at most 62 bits to integer division. In its orignal form the program is presented 
in assembly language for the PDP-6/10 (36 bit architectures), which is not very 
well-known nowadays 

We render it here in C (where the variable names a and b correspond to the 
registers of the original program, comments are given separately): 

#define TWOBITS 033333333333 
#define THREEBITS 030707070707 

bitcount (a) 
unsigned a; 
{ 

unsigned b; 



^■Most of the PDP-6/10 instructions are quite suggestive. Two possible exceptions are: 

LDB B, [014300, , A] : Load 35 bits (octal 043) from A with an offset 1, counting bits from 
the right. Store them right adjusted in B (line 1 of the C program, this cannot be 
implemented directly in C and we use the approach mentioned in the comment of [3] 
item 169] on the LDB instruction). 

SUBB A,B: Subtract and store in both A and B (lines 6 and 7of the C program). 

See [2] for more information on the PDP-6/10 instruction set. 
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b = a»l; // line 1 

b &= TWOBITS; // line 2 
a -= b; // line 3 

b »= 1; 
b &= TWOBITS; 

a -= b; // line 6 

b = a; // line 7 

b >>= 3; 

a += b; // line 9 

a &= THREEBITS; 

return(a % 077); // line 11 

} 

Comments: 

lines 1-6 : Consider an octal digit of variable a composed of three bits x, 
y, and z with weight 4x + 2y + z. Then in line 3 the value 2x + y + z 
is computed and in line 6 the sum x + y + z of the three original bits is 
computed. 

line 7 : This simulates the second transfer of SUBB. 

line 9 : Neighboring groups of octal digits are added in order to compute the 
sum of six bits. Notice that the three resulting bits are able to hold the 
maximum count for 6 bits without a carry to the next octal digit. 

line 11 : Consider the contents of a as a number in base 64 representation. 
Then the digit at position i from the right (starting at 0) has weight 

64 l = (63 + I) 1 = 63 4 + 63 i_1 i + • • • + 63i + 1 

amd modulo 63 the sum of all base 64 digits is computed. This limits the 
admissable word-length to 62 bits. 
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